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Abstract. We present MultiGain, a tool to synthesize strategies for 
Markov decision processes (MDPs) with multiple mean-payoff objectives. 
Our models are described in PRISM, and our tool uses the existing interface 
and simulator of PRISM. Our tool extends PRISM by adding novel algo¬ 
rithms for multiple mean-payoff objectives, and also provides features such 
as (i) generating strategies and exploring them for simulation, and checking 
them with respect to other properties; and (ii) generating an approximate 
Pareto curve for two mean-payoff objectives. In addition, we present a new 
practical algorithm for the analysis of MDPs with multiple mean-payoff ob¬ 
jectives under memoryless strategies. 


1 Introduction 

Markov decision processes (MDPs) are the de facto model for analysis of probabilis¬ 
tic systems with non-determinism EH , with a wide range of applications [5]. In each 
state of an MDP, a controller chooses one of several actions (the nondeterministic 
choices), and the current state and action gives a probability distribution over the 
successor states. One classical objective used to study quantitative properties of 
systems is the limit-average (or mean-payoff) objective, where a reward (or cost) is 
associated with each transition and the objective assigns to every run the average 
of the rewards over the run. MDPs with single mean-payoff objectives have been 
well studied in the literature (see, e.g., [13]). However, in many modeling domains, 
there is not a single goal to be optimized, but multiple, potentially interdependent 
and conflicting goals. For example, in designing a computer system, the goal is to 
maximize average performance while minimizing average power consumption. Simi¬ 
larly, in an inventory management system, the goal is to optimize several dependent 
costs for maintaining each kind of product. The complexity of MDPs with multiple 
mean-payoff objectives was studied in [6j. 

In this paper we present MultiGain, which is, to the best of our knowledge, the 
first tool for synthesis of controller strategies in MDPs with multiple mean-payoff 
objectives. The MDPs and the mean-payoff objectives are specified in the well- 
known PRISM modelling language. Our contributions are as follows: (1) we extend 
PRISM with novel algorithms for multiple mean-payoff objectives from [6j; (2) de¬ 
velop on the results of [B] to synthesize strategies, and explore them for simulation, 
and check them with respect to other properties (as done in PRISM-games 0 ); 
and (3) for the important special case of two mean-payoff objectives we provide the 
feature to visualize the approximate Pareto curve (where the Pareto curve repre¬ 
sents the “trade-off” curve and consists of solutions that are not strictly dominated 


by any other solution). Finally, we present a new practical approach for analysis 
of MDPs with multiple mean-payoff objectives under memoryless strategies: pre¬ 
viously an NP bound was shown in [7] by guessing all bottom strongly connected 
components (BSCCs) of the MDP graph for a memoryless strategy and this gave 
an exponential enumerative algorithm; in contrast, we present a linear reduction 
to solving a boolean combination of linear constraints (which is a special class of 
mixed integer linear programming where the integer variables are binary). 

2 Definitions 

MDPs and strategies. An MDP G = ( S , A, Act, 5) consists of (i) a finite set S of 
states; (ii) a finite set A of actions, (iii) an action enabledness function Act : S —> 
2 a \ {0} that assigns to each state s the set Act(s) of actions enabled at s, and 
(iv) a transition function S : S x A —> dist(S) that given a state s and an action 
a £ Act(s) gives a probability distribution over the successor states (dist(S) denotes 
all probability distributions over S). W.l.o.g. we assume that every action is enabled 
in exactly one state, and we denote this state Src(a). Thus, we will assume that 
5 : A —> dist(S). Strategies describe how to choose the next action given a finite 
path (of state and action pairs) in the MDP. A strategy consists of a set of memory 
elements to remember the history of the paths. The memory elements are updated 
stochastically in each transition, and the next action is chosen probabilistically 
(among enabled actions) based on the current state and current memory [6]. A 
strategy is memoryless if it depends only on the current state. 

Multiple mean-payoff objectives. A single mean-payoff objective consists of a 
reward function r that assigns a real-valued reward r(s, a) to every state s and 
action a enabled in s, and the mean-payoff objective mp(r) assigns to every infinite 
path (or run) the long-run average of the rewards of the path, i.e., for a infinite path 
7T = (s 0 a 0 siai...) we have mp(r)(7r) = lim inf^-^ \ r ( s ii a i)- In multiple 

mean-payoff objectives, there are k reward functions r 1; r-z ,..., r*,, and each reward 
function r, defines the respective mean-payoff objective mp(?y). Given a strategy a, 
we denote by EJ[-] the expectation measure of the strategy given a starting state s. 
Thus for a mean-payoff objective mp(r), the expected mean-payoff is Ef [mp(r)]. 
Synthesis questions. The relevant questions in analysis of MDPs with multiple 
objectives are as follows: (1) (Existence). Given an MDP with k reward functions, 
starting state So; and a vector v = [v\,v 2 , - - -, ffc) of k real-values, the existence 
question asks whether there exists a strategy a such that for all 1 < i < k we have 
®s 0 [ m P( ri )] ^ v i■ (2) (Synthesis). If the answer to the existence question is yes, 
the synthesis question asks for a witness strategy to satisfy the existence question. 
An optimization question related to multiple objectives is the computation of the 
Pareto-curve (or the trade-off curve), where the Pareto curve consists of vectors v 
such that the answer to the existence question is yes, and for all vectors v' that 
strictly dominate v (i.e., v' is at least v in all dimensions and strictly greater in at 
least one dimension) the answer to the existence question is no. 

3 Algorithms and Implementation 

We first recall the existing results for MDPs with multiple mean-payoff objectives [S], 
and then describe our implementation and extensions. Before presenting the existing 
results, we first recall the notion of maximal end-components in MDPs. 


Go (s) + EaeA 2/“ ■ <*( a )( s ) = E a eAct( s ) V°- + V* 

for all s G S 

(1) 

V s I 


(2) 

E s gc V 3 = EaeAnc Xa 

for all MECs C of G 

(3) 

E a£ A Xa ■ S(a)(s) = Za£Act( S ) X ° 

for all s G S 

(4) 

Ea eA Xa ■ Vi ( a ) ^ Vi 

for all 1 < i < k 

(5) 


Fig. 1: System L of linear inequalities (here l So (s) is 1 if s=Sq , and 0 otherwise). 


Maximal end-components. A pair ( T,B ) with and B C (J fgT Act{t) 

is an end component of G if (1) for all a G B, whenever <5(a)(s') > 0 then s' G T; 
and (2) for all s,t G T there is a finite path from s to i such that all states and 
actions that appear in the path belong to T and B , respectively. An end component 
(' T,B) is a maximal end component (MEC) if it is maximal wrt. pointwise subset 
ordering. An MDP is unichain if for all B C A satisfying B D Act(s) ^ 0 for any 
s G S' we have that (S, B) is a MEC. Given an MDP, we denote Smec the set of 
states s that are contained within a MEC. 

Result from |Bj. The results of [B] showed that (i) the existence question can be 
answered in polynomial time, by reduction to linear programming; (ii) if there exists 
a strategy for the existence problem, then there exists a witness strategy with only 
two-memory states. It also established that if the MDP is unichain, then memoryless 
strategies are sufficient. The polynomial-time algorithm is as follows: it was shown 
in [5j that the answer to the existence problem is yes iff there exists a non-negative 
solution to the system of linear inequalities given in Fig. |Tj 

Syntax and semantics. Our tool accepts PRISM MDP models as input, see 
PP for details. The multi-objective properties are expressed as multi (.list) or 
mlessmulti (list) where list is a comma separated list of mean-payoff reward prop¬ 
erties, which can be boolean , e.g. R{ *r 1 1 }>=0.5 [S], and in the case of multi also 
numerical , e.g. R{ , r2’}min=? [S]. In the reward properties, S stands for steady- 
state, following PRISM’s terminology. 

If all properties in the list are boolean, the multi-objective property multi Gist) 
is also boolean and is true iff there is a strategy under which all given reward 
properties in the list are simultaneously satisfied. If there is a single numerical query, 
the multi-objective query intuitively asks for the maximal achievable reward of the 
numerical reward query, subject to the restriction given by the boolean queries. We 
also allow two numerical queries; in such case MultiGain generates a Pareto curve. 
The semantics of mlessmulti follows the same pattern, the only difference being 
that only memoryless (randomised) strategies are being considered. The reason we 
don’t allow numerical reward properties in mlessmulti is that the supremum among 
all memoryless strategies might not be realised. 

Implementation of existence question. We have implemented the algorithm 
°f IS- Our implementation takes as input an MDP with multiple mean-payoff ob¬ 
jectives and a value vector v , and computes the linear inequalities of Fig. |T] or a 
mixed integer linear programming (MILP) extension in case of memoryless strate¬ 
gies. The system of linear inequalities is solved with LPsolve [2] or Gurobi [3]. 


Implementation of the synthesis question. We now describe how to obtain 
witness strategies. Assume that the linear program from Fig. [l] has a solution, 
where a solution to a variable z is denoted by z. We construct a new linear program, 
comprising Eq. |Tj together with the equations y s = J2 a eAct(s) Xa f° r s ^ Smec- 

Let z denote a solution to variables z in this linear program. The stochastic- 
update strategy is defined to have 2 memory states (“transient” and “recur¬ 
rent”), with the transition function defined to be a t (s)(a ) = y a / J2beAct(s) Vb an d 
<j r (s)(a) = x a / YlbeAct(s) Xb ’ an d the probability of switching from “transient” to 
“recurrent” state upon entering s being y s /(52 a eAct(s) Va + Vs)- The correctness of 
the witness construction follows from [B]. 

MILP for memoryless strategies. For memoryless strategies, the current up¬ 
per bound is NP [7] and the previous algorithm enumerates all possible BSCCs 
under a memoryless strategy. We present a polynomial-time reduction to solv¬ 
ing a boolean combination of linear constraints, that can be easily encoded us¬ 
ing MILP with binary variables |15| . The key requirement for memoryless strate¬ 
gies is that a state can either be recurrent or transient. For the existence ques¬ 
tion restricted to memoryless strategies we modify the linear constraints from 
Fig. 0 as follows: (i) we add constraints; for all states s and actions b £ Act(s): 
yb > o => (Xb > 0 V J2a£Act(Src(b)) Xa = 0); (ii) we replace constraint 0 from 
Fig. 0 by constraints that for all states s: y s = Y^aeAct(s) Xa - The constraint (ii) 
is a strengthening of constraint as the above constraint implies constraint (§■ 
The intuition behind the additional constraint is as follows: Yha^Act(Src{b)) x a = 0 
represents transient states where there is no restriction on yb, otherwise yb can 
be positive only if Xb is positive. The witness strategy a is as follows: let ~z de¬ 
note a solution to the MILP for the memoryless strategy question; for a state s, if 
E a&Act(s) x a = 0 (i.e., s is transient), then cr(s)(a) = V a /T,beAct( s )Vb, otherwise, 
cr(s)(a) = x a / J2b£Act(s) x b- Further details are in Appendix. 

Approximate Pareto curve for two objectives. To generate a Pareto curve, we 
successively compute solutions to several linear programs for a single mean-payoff 
objective, where every time the objective is obtained as a weighted sum of the 
objectives for which the Pareto curve is generated. The weights are selected in a 
way similar to uni, allowing us to obtain the approximation of the curve. 

Unlike the PRISM implementation for multi-objective cumulative rewards, 
our tool is able to generate the Pareto curve for objectives of the form 
multi(R{’rl’}max=? [S], R{’r2 , }max=? [S], R{ , r3’}>=0.5 [S]) where the 
objectives to be optimised are subject to restrictions given by other rewards. 

Features of our tool. In summary, our tool extends PRISM by developing al¬ 
gorithms to solve MDPs with multiple mean-payoff objectives. Along with the al¬ 
gorithm from [B| we have also implemented a visual representation of the Pareto 
curve for two-dimensional objectives. The implementation utilises a multi-objective 
visualisation available in PRISM for cumulative reward and LTL objectives. 

In addition, we adapted a feature from PRISM-games [5] which allows the user 
to generate strategies, so that they can be explored and investigated by simulation. 
A product (Markov chain) of an MDP and a strategy can be constructed, allowing 
the user to employ it for verification of other properties. 



Fig. 2: Screenshot of MultiGain (largely inheriting from the PRISM GUI). 

The tool is available at http: //qav. cs . ox. ac. uk/multigain/, and the source 
code is provided under GPL. For licencing reasons, Gurobi is not included with the 
download, but it can be added manually by following provided steps. 

4 Experimental Results: Case Studies 

We have evaluated our tool on two standard case studies, adapted from pQ, and also 
mention other applications where our tool could be used. 

Dining philosophers is a case study based on the algorithm of [5], which extends 
Lehmann and Rabin’s randomised solution |12| to the dining philosophers problem 
so that there is no requirement for fairness assumptions. The constant N gives 
the number of philosophers. We use two reward structures, think and eat for the 
number of philosophers currently thinking and eating, respectively. 

Randomised Mutual Exclusion models a solution to the mutual exclusion problem 
by nu. The parameter N gives the number of processes competing for the access to 
the critical section. Here we defined reward structures try and crit for the number 
of processes that are currently trying to access the critical section, and those which 
are in it, currently (the latter number obviously never being more than 1). 

Evaluation The statistics for some of our experiments are given in Table [I] (the 
complete results are available from the tool’s website). The experiments were run 
on a 2.66GHz PC with 4GB RAM, the LP solver used was Gurobi and the timeout 
(“t/o”) was set to 2 hours. We observed that our approach scales to mid-size models, 
the main limitation being the LP solver. 

Other applications. We mention two applications which are solved using MDPs 
with multiple mean-payoff objectives. (A) The problem of synthesis from incompat¬ 
ible specifications was considered in [16]. Given a set of specifications tpi, ip?, ■ ■ ■ > Pk 
that cannot be all satisfied together, the goal is to synthesize a system such that 
for all 1 < i < k the distance to specification tpi is at most i>j. In adversarial 
environments the problem reduces to games and for probabilistic environments to 
MDPs, with multiple mean-payoff objectives [TBJ. (B) The problem of synthesis of 
steady state distributions for ergodic MDPs was considered in [4j . The problem can 





















































model 

para. 

property 

MDP 

LP 

total 

solving 

value 

(A: multi(.. . B: mlessmulti(. . . )) 

states 

vars (binary) 

rows 

time (s) 

time (s) 


3 

A:R{"think"}max=?,R{"eat"}>=0.3 

956 

6344 

1915 

0.23 

0.08 

2.119 

phil 

3 

B:R{"think"}>=2.11, R{"eat"}>=0.3 

956 

12553(6344) 

11773 

209.9 

209.7 

true 

3 

B:R{"think"}>=2.12,R{ "eat"}>=0.3 

956 

12553(6344) 

11773 

20.9 

20.7 

false 


4 

A:R{"think"}max=?,R{"eat"}>=l 

9440 

80368 

18883 

4.4 

3.8 

2.429 


5 

A:R{"think"}max=?,R{"eat"}>=l 

93068 

967168 

186139 

616.0 

606.4 

3.429 


3 

A: R{"try"}max=? [S] , R{ "crit"}>=0.2 

27766 

119038 

55535 

214.9 

212.7 

2.679 

mutex 

4 

A: R{"try"}max=? [S] , R{"crit"}>=0.3 

668836 

3010308 

1337675 

t/o 

t/o 

t/o 


4 

A: R{"try"}>=3.5 [S] , R{"crit"}>=0.3 

668836 

3010308 

1337676 

4126 

4073 

true 


Table 1: Experimental results. For space reasons, the [S] argument to R is omitted. 


be modeled with multiple mean-payoff objectives by considering indicator reward 
functions r s , for each state s, that assign reward 1 to every action enabled in s 
and 0 to all other actions. The steady state distribution synthesis question of [4j 
then reduces to the existence question for multiple mean-payoff MDPs. 

Concluding remarks. We presented the first tool for analysis of MDPs with multi¬ 
ple mean-payoff objectives. The limiting factor is the LP solver, and so an interesting 
direction would be to extend the results of m to multiple objectives. 
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5 Appendix 


Argument for mixed ILP for memoryless strategies. The main idea of the 
correctness argument of the Boolean combination of linear constraints is as follows. 
Given a memoryless strategy o\ once the strategy is fixed we obtain a Markov chain 
with two types of states, transient states and recurrent states. For recurrent states 
the variable x represent the flow equations and intuitively, the variable x a denotes 
the long-run average frequency of the action a. The constraint Yla^Act(s) = 0 is 
the constraint to represent that a state is transient. The variables y a are used to 
determine the probabilities to reach the recurrent classes. For an action 6, if Src(b) 
is transient, then for the first additional constraint for memoryless strategies, since 
Y^aeAct(Src(b)) £a = 0 is satisfied, it follows that there is no restriction on j/f,. Once a 
recurrent class is reached, it suffices to switch deterministically to the strategy of the 
recurrent class. However, this is encoded slightly differently: we allow to continue 
before switching but only in the recurrent class, which is enforced by the constraint 
yb> 0 => Xb > 0. For a state s in a recurrent class Y^aeAct(s) Xa denotes the long- 
run average frequency of s, and the constraint (ii) requires that the long-run average 
frequency of s coincides with the probability y s to reach s as the first state of the 
recurrent class. Note that for a recurrent class Z, the sum J2sgz 2/s represents the 
probability that the recurrent class Z is reached. We describe the details of witness 
strategy constructions. 

Strategy from solution. Given a solution to the mixed ILP, let z denote the solution 
to variable z. We construct a witness memoryless strategy as follows: for a state 
^ E q6 Act( s ) Xa > then f° r a ii a ' e Act{s) the memoryless strategy plays a' 
with probability x a > / J2 a &Act(s) x a- For a state Y^ a eAct(s) x a = 0, then for all 
a' £ Act(s) the nrenroryless strategy plays a' with probability y a >/ YlaeAct(s) 2/a- 

Solution from strategy. Consider a witness memoryless strategy a and consider the 
Markov chain obtained by fixing the strategy. Let X denote the set of recurrent 
states in the Markov chain, and Y = S\X the set of transient states. The assignment 
of the variables to satisfy the constraints of the mixed ILP are as follows: (a) For 
states s in X and action a £ Act(s), the variable x a is assigned the long-run average 
frequency of action a in the Markov chain; and y s is assigned the probability that 
the first state reached in the recurrent class containing s is s. (b) For states s in Y 
and a £ Act(s) we assign x a = 0 and y a = a(s)(a) the probability assigned by the 
strategy a. 



